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Adding  a  zero-crossing  count  to  spectral  information 
in  template-based  speech  recognition 


Zero-crossing  data  can  provide  important  feature  information  about  an  utterance 
which  is  not  available  in  a  purely  spectral  representation.  This  report  describes  the 
incorporation  of  zero-crossing  information  into  the  spectral  representation  used  in  a 
template-matching  system  (CICADA).  An  analysis  of  zero-crossing  data  for  an  extensive 
(2880  utterance,  8  talker)  alpha-digit  data  base  is  described.  On  the  basis  of  this  analysis, 
a  zero-crossing  algorithm  is  proposed.  The  algorithm  was  evaluated  using  a  confusible 
subset  of  the  alpha-digit  vocabulary  (the  "E-set").  Inclusion  of  zero-crossing  information 
in  the  representation  leads  to  a  10-13%  reduction  in  error  rate,  depending  on  the  spectral 
representation^ 
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1  Introduction1 

An  important  consideration  in  the  design  of  speech  recognition  systems  is  the  choice  of  an 
accurate  yet  economical  representation  for  the  speech  signal  [Davis  80,  White  76].  Most  systems  use 
a  compact  encoding  of  the  short-term  spectrum  such  as  LPC  or  coefficients  derived  from  band-pass 
filtering.  Necessarily,  a  great  deal  of  information  (such  as  spectral  detail  and  temporal  structure)  is 
lost  in  the  encoding.  For  vocabularies  containing  words  whose  spectra  are  highly  distinct  (e.g.,  the 
digits)  such  a  representation  is  adequate  for  high-accuracy  recognition.  In  other  cases,  the  coarseness 
of  the  spectral  mapping  leads  to  difficulties  in  discrimination.  A  subset  of  the  alpha-digit 
vocabulary,  words  ending  in  the  vowel  /i/.  illustrates  such  difficulties.2  Utterances  in  the  /i/  set  are 
confusible  because  their  distinctive  characteristics  are  restricted  almost  entirely  to  a  short  segment  at 
the  beginning  of  the  utterance  (the  consonant).  It  is  in  the  nature  of  template  matching  to  give  equal 
weight  to  all  portions  of  an  utterance,  as  a  result  the  contribution  of  the  initial  segment  to  the  total 
distance  between  two  utterances  is  frequently  outweighted  by  random  variations  in  the  remainder  of 
the  utterance  (the  vowel).  The  goal  of  the  work  reported  in  this  paper  is  to  explore  techniques  that 
enhance  the  contribution  of  phonetically  significant  portions  of  an  utterance  while  preserving  a 
representation  that  allows  a  uniform  template  matching  procedure  to  be  used.  The  work  described 
in  this  paper  was  done  using  the  CICADA  system  developed  at  Camegie-Mellon  University  [Alieva 
81,  Waibel  801.  cicada  uses  a  representation  based  on  a  compression  of  the  short-term  spectrum 
according  to  a  16  coefficient  mel  scale. 

Let  us  consider  the  cicada  representation  in  more  detail:  In  addition  to  a  loss  of  fine  spectral 
detail,  two  major  features  of  the  speech  signal  are  lost  in  the  mel-scale  compression:  The  pitch  of  the 
vocalic  portions  and  the  distinction  between  a  periodic  and  an  aperiodic  signal.  Although  the 
contribution  of  pitch  information  to  phonetic  identity  is  not  well  understood  (see,  however,  [Massaro 
78D,  the  latter  distinction  provides  infoimation  that  can  be  used  to  discriminate  otherwise 
confusable  utterances,  such  as  "C"-"Z"  and  'T"-"D". 

A  purely  spectral  representation,  particularly  the  kind  used  in  the  cicada  system,  has  only  at  best 
ambiguous  information  about  the  excitation  source  for  a  given  speech  segment  In  this  paper  we 
describe  an  investigation  of  the  potential  advantages  of  including  information  about  excitation 
source  as  a  supplement  to  the  spectral  representation.  We  chose  to  investigate  the  zero-crossing 

lNOTE:  Some  of  the  data  described  in  this  paper  were  reported  earlier  at  the  102nd  Meeting  of  the  Acoustical  Society  of 
America,  December,  1981 

10-member  cooftjsable  set  is  composed  of  the  tetters  "B"  ,"e,"DH."E","G"."P“,,T,.'V-.“Z“.  and  the  digit  "3". 
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count  as  a  source  of  such  information  because  of  its  straightforward  derivation  and  its  familiarity  in 
the  field  (sec  e.g.,  [Baker  74]), 

2  An  analysis  of  zero-crossing  statistics 

The  present  section  introduces  the  count  method,  describes  a  number  of  statistics  for  our  speech 
corpus,  and  examines  the  use  of  statistical  data  as  a  basis  for  recognition  decisions. 

Zero-crossing  statistics  were  collected  for  all  utterances  in  our  data  base3  (a  total  of  2880 
utterances).  The  zero-crossing  count  was  calculated  using  the  same  time-frame  parameters  used  for 
the  calculation  of  spectral  coefficients;  that  is,  over  a  20  msec  window  stepped  10  msec  through  the 
utterance.  The  zero-crossing  count  was  calculated  using  non-preemphasized  speech.  The  potential 
range  for  the  resulting  zero-crossing  count  was  thus  0-200,  in  actuality  the  observed  range  was  2-164. 
A  number  of  alternate  counting  algorithms  could  have  been  used,  most  notably,  calculating  the 
zero-crossing  count  for  only  the  central  10  msec  of  each  spectral  frame.  Since  the  waveform  is 
Hamming-windowed  before  spectral  analysis,  the  central  10  msec  would  correspond  to  the  region 
that  makes  the  major  contribution  to  the  frame  spectrum.  On  the  other  hand,  a  10  msec  window  can 
give  an  unstable  estimate  of  the  zero-crossing  count,  thus  introducing  extra  noise  into  the  parameter. 
Empirical  test  seems  to  indicate  that  the  20  msec  window  results  in  better  performance  (see  section 
3). 

Zero-crossing  counts  were  collected  for  the  utterance  as  bounded  by  the  begin  and  end  points 
determined  by  an  automatic  begin-end  detector  [Yegna  79].  The  following  statistics  were  calculated: 
the  mean,  the  standard  deviation,  the  median,  and  the  range.  In  addition,  the  median  zero-crossing 
count  in  ten  equally-spaced  intervals  within  an  utterance  was  calculated. 

Over  the  8  talker  database  we  observe  a  2- 164  range  of  zero-crossing  counts.  The  highest  mean 
and  median  zero-crossing  counts  are  found  for  foe  utterance  "SIX".  This  is  to  be  expected,  since 
"SIX"  contains  proportionately  foe  most  frication  in  foe  alpha-digit  vocabulary.  Looking  at  foe 
standard  deviations  (SDs),  we  see  again  as  expected,  that  utterances  containing  fricatives  have  high 
standard  deviations.  If  we  consider  foe  absolute  difference  between  foe  mean  and  foe  median  to  be 
a  rough  estimate  of  foe  skewness  of  a  particular  distribution,  we  note  that  pairs  of  utterances 
differing  primarily  in  foe  degree  of  frication  present  (for  example,  C-Z,  T-D,  and  P-B)  show 


*The  database  consists  of  10  tokos  of  each  word  in  the  36  member  alpha-digit  vocabulary  (A..Z:0..9)  recorded  by  8  talkers 
(4  male.  4  female).  All  talkers  were  "naive'*.  The  material  was  recorded  i  audio  tape  in  a  moderately  noisy  ("office") 
environment  then  digitized  using  a  10  kHzaunpHng  rate. 


differences  in  die  skewness  of  their  zero-crossing  count  distributions:  utterances  with  frication  have 
more  skewed  distributions.  Differences  between  utterances  become  even  more  apparent  if  we 
restrict  the  scope  of  our  measurements  to  the  initial  portion  of  such  utterances. 


2.1  Using  an  utterance’s  zero-crossing  count  characteristics 

The  data  described  above  suggest  that  we  might  be  able  to  use  the  behaviour  of  the  zero-crossing 
count  in  an  utterance  to  supplement  spectral  matching.  Unfortunately,  these  differences  cannot  be 
directly  translated  into  reliable  tests  that  distinguish  pairs  of  spectrally  similar  utterances  (for 
example,  P-B).  To  show  that  this  is  the  ease,  we  will  examine  several  specific  instances. 


Table  1  shows  the  range  of  standard  deviations  (SD)  found  for  the  minimal  pairs  B-P  and  D-T.  If 
SD  is  to  be  used  to  reliably  discriminate  between  the  members  of  the  two  pairs,  then  the  SD  ranges 
for  the  two  members  of  each  pair  must  not  overlap.  As  can  be  seen  from  the  Table,  this  is  the  case 
only  half  the  time.  Several  talkers  (ds,  gg,  rp)  show  consistent  discrimination  for  both  pairs.  For  the 
other  talkers,  however,  the  separation  is  absent  or  is  inconsistent  across  the  two  pairs  examined.  The 
same  pattern  is  apparent  for  the  skewness  measure  (Table  2):  The  members  of  a  pair  can  be 
consistently  distinguished  for  five  talkers,  inconsistently  for  two  more  and  not  at  all  for  the 
remaining  talkers.  Focussing  on  those  parts  of  the  utterance  known  (from  phonetic  analysis)  to  carry 
the  discriminative  information,  in  this  case  the  beginning  of  the  utterance,  does  not  allow  us  to 
realise  an  increase  in  accuracy  or  consistency  (Table  3). 

Table  1:  The  use  of  variance  information  for  voicing  discrimination 
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Total  correct  classifications:  3  S 


Note  values  are  the  range  of  variances,  calculated  over  10  utterances  for  each  talker. 


This  short  exercise  leads  to  the  conclusion  that  zero-crossing  information  is  not  consistently  useful 
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Tabic  2:  Mean-median  disparity 
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Table  3:  Zero  crossing  count  range  for  voiced  and  unvoiced  pairs 


voiced-unvoiced  pairs  non-overlaps 
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39 
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33 

37 

59 

0 
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L5 

5 

1 

Note  The  entires  in  this  table  indicate  the  relevant  extrema  of  the  range  distribution.  Thus,  for  the  entries  for  the 
unvoiced  members  of  each  pair  indicate  the  lowest  zero-crossing  count  observed  for  that  utterance,  while  the  entries  for  the 
voiced  member  indicates  the  highest  zero-crossing  count  for  that  utterance:  Obviously,  if  the  lowest  unvoiced  is  less  than  the 
highest  voiced  count,  zero  crossing  count  is  not  diagnostic  of  voicing  Each  value  is  based  on  ten  (10)  instances. 

over  all  talkers  or  over  all  utterances.  It  is  the  case,  however,  that  such  information  is  consistent  for 
some  talkers  and  for  some  distinctions.  Use  of  zero-crossing  information,  therefore,  requires  the 
presence  of  additional  knowledge,  such  as  might  be  obtained  through  tuning  to  a  particular  talker  or 
through  a  recognition  procedure  that  has  the  capability  to  narrow'  down  choices  to  small  sets  for 
which  the  zero-crossing  (or  any  other)  attributes  are  well  understood  (see  [Cole  81J  for  an  example  of 
this  strategy).  As  an  example  of  the  latter,  if  the  choices  for  an  utterance  can  be  narrowed  down  to 
"T-D",  then  the  SD  information  can  be  used  to  pick  either  one  word  or  the  other.  Likewise,  if  the 
choice  can  be  narrowed  down  to  "P-B",  then  the  (raw)  zero-crossing  count  for  the  initial  portion  of 


the  utterance  can  be  used  as  a  source  of  evidence.  The  implication  is  that  given  the  imperfect 
reliability  of  zero-crossing  information,  it  is  best  used  in  the  context  of  a  recognition  strategy  that 
incorporates  detailed  knowledge  of  the  characteristics  of  speech  sounds  and  uses  an  informed 
sequential  decision  strategy.  While  the  C-MU  speech  group  is  actively  pursuing  work  on  such 
systems,  the  focus  of  the  present  paper  is  on  the  enhancement  of  template  matching  techniques. 

One  of  the  attractions  of  dynamic  time  warping  is  that  it  is  a  general  decision  procedure  and 
makes  no  use  of  domain  specific  information.  This  allows  the  use  of  highly  efficient  search 
procedures,  but  requires  that  adequate  discriminative  information  be  included  in  the  representations 
of  the  events  being  matched.  The  question  therefore  is  whether  a  parameter  such  as  the  zero- 
crossing  count  can  selectively  enhance  discriminative  information  contained  in  the  representation. 

Concretely,  we  are  interested  .in  determining  the  optimal  manner  in  which  to  extract  and 
transform  zero-crossing  information  and  include  it  as  a  part  of  a  spectral  template. 

3  Designing  an  optimal  zero-crossing  function 

The  ideal  zero-crossing  coefficient  would  take  on  a  neutral  value  when  the  speech  signal  was 
either  silence  or  vocalic  and  go  to  one  of  several  levels  when  different  kinds  of  aperiodic  energy  were 
present  in  the  speech.  Such  a  function  is  in  practice  difficult  to  design,  however,  a  reasonable 
approximation  can  be  achieved.  To  do  so,  we  must  be  able  to  define  the  following  parameters: 

•  a  floor  value:  a  count  below  which  it  can  be  assumed  no  aperiodic  energy  of  interest  is 
present,  Le.,  during  vocalic  portions  or  during  silence. 

•  a  ceiling  value:  a  count  above  which  differentiating  different  degrees  of  zero-crossing 
counts  is  not  informative. 

•  quantization:  a  mapping  of  the  range  between  the  ceiling  and  the  floor  which  conveys 
useful  information  about  the  nature  of  the  speech  signal,  e.g.,  providing  distinctions 
between  voiced  and  unvoiced  fricatives. 

The  experiments  described  below  represent  a.systematic,  though  certainly  not  exhaustive,  search  for 
optimal  settings  of  the  above  parameters. 

3.1  General  procedure  for  all  experiments 

Except  in  the  case  of  some  pilot  experiments  (summarized  in  section  3.5),  all  experiments  were 
performed  using  the  foil  set  of  utterances  (800)  in  the  confosible  "E"  set  All  matches  were 
performed  within  talkers,  using  each  of  the  ten  repetitions  of  a  vocabulary  item  in  turn  as  a  reference 
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template,  Thus,  a  total  of  900  matches  were  performed  for  each  talker  data  set  or  a  total  of  7200 
matches  for  each  condition  in  the  experiments  described  below.  All  experiments  were  performed 
using  the  CICADA2  system  [Alieva  81]. 

3.2  The  basic  zero-crossing  algorithm 

A  zero-crossing  is  defined  as  a  transition  between  two  successive  waveform  samples  that  produces 
a  change  in  sign.  The  number  of  such  transidons  per  frame  is  taken  as  the  raw  zero-crossing  count 
for  that  frame.  Except  in  the  case  of  a  few  experiments  (see  section  3.5),  the  raw  zero-crossing  count 
is  transformed  by  the  following  equadon: 

ZCcoefflclent  *  (RawZC-F1oor)/RangeFactor 

The  Floor  and  RangeFactor  parameters  correspond  to  the  parameters  described  at  the 
beginning  of  this  sccdon.  Together  they  define  a  floor  (cxplicidy)  and  a  ceiling  (implicitly)  for  the 
function.  Values  which  fall  outside  this  range  are  automatically  clipped  to  the  boundary  values.  In 
actual  use,  ZCcoeff  Iclont  is  normalized  to  a  [*7..+7]  range,  corresponding  to  a  4  bit  code.  This 
feature  is  useful  for  our  particular  representation,  as  the  spectral  coefficients  are  also  in  this  range. 
The  contribution  of  ZCcoeff  Iclent  to  the  calculation  of  a  distance  between  two  frames  within 
the  waiping  algorithm  is  equivalent  to  that  of  a  single  spectral  coefficient,  or  one  sixteenth. 
Differential  weighting  of  ZCcoefflclent  is  discussed  in  section  3.5. 

3.3  Floor  value 

For  our  present  purposes  we  will  assume  that  the  kind  of  information  provided  by  the  zero 
crossing  count  is  rather  limited  in  scope  (but  see  [Baker  74]).  That  is,  it  can  signal  the  presence  of 
aperiodic  speech  energy;  perhaps  differentiate  location  and  aspiration,  but  no  more.  Zero-crossings, 
however,  are  always  present  in  recorded  speech:  During  periodic  portions,  during  nominal  silences, 
as  well  as  during  aperiodic  portions.  To  make  the  count  more  sensitive,  it  is  desirable  to  eliminate 
the  "noise"  introduced  by  zero  crossings  during  silences  and  vocalic  speech.  One  common 
technique  is  to  calculate  zero-crossing  counts  on  the  basis  of  a  center-clipped  signal  Although 
center-clipping  has  a  number  of  advantages,  we  decided  against  using  it  because  of  its  apparent 
sensitivity  to  changes  in  signal  amplitude.  In  fact,  one  of  our  design  considerations,  being  able  to 
treat  vocalic  and  silent  segments  equivalently,  made  the  use  of  a  "floor"  threshold  desirable. 

A  floor  parameter  is  specified  by  selecting  a  threshold  value  for  zero-crossing  counts  such  that  if 
the  zero-crossing  count  falls  below  this  value,  it  is  assumed  that  no  aperiodic  speech  energy  is 
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present  To  establish  die  level  of  "background  zero-crossing"  for  voiced  speech  we  can  examine  a 
vocalic  utterance  such  as  "ONE"  and  note  the  zero-crossing  values  encountered.  A  strict  criterion 
would  involve  setting  the  threshold  to  the  maximum  zero-crossing  value  found  in  the  all-voiced 
utterance.  Although  it  would  not  be  possible  to  guarantee  dial  any  value  above  this  threshold . 
indicates  the  presence  of  aperiodic  energy  (consider,  for  example,  the  interpolation  of  some 
environmental  noise),  the  likelyhood  of  diis  being  the  ease  would  be  very  high.  Since  a  DC  offset  or 
the  presence  of  voicing  might  alter  the  zero-crossing  count  for  a  segment  that  otherwise  would  be 
unambiguously  identified  as  frication,  it  might  be  desirable  to  set  a  laxer  criterion,  preemphasize  the 
signal,  or  even  use  the  raw  zero-crossing  count 

The  purpose  of  the  present  experiment  is  to  establish  an  optimal  value  for  this  threshold,  or 
Floor  value.  To  simplify  the  design,  the  RangeFactor  parameter  was  set  to  a  constant  value  (4) 
that  would  ensure  that  most  of  the  zero-crossing  range  was  included,  with  minimal  clipping  at  the 
ceiling  value. 


0  10  20  30  40  SO 

Figure  1:  Performance  (%  error)  as  a  function  of  floor  value. 


The  experiment  v  as  performed  using  two  spectral  mapping  scales:  a  16  coefficient  mel  scale  and  a 
16  coefficient  scale  based  on  the  French  and  Steinberg  [French  47]  equi-intelligibility  scale;  this 
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laucr  scale  will  be  referred  to  as  the  FreSt  scale  [Rudnicky  82].  The  results  arc  presented  graphically 
in  Figure  1.  The  optimal  floor  value  . for  the  mel  scale  is  about  30,  while  the  optimal  value  for  the 
FreSt  scale  is  about  20.  The  minima  of  the  two  functions  arc  rather  shallow  and  thus  these  values 
arc  approximate.  There  does  not  appear  to  be  any  ready  explanation  for  the  difference  in  minima 
for  the  two  scales. 

The  present  experiment  establishes  that  the  presence  of  a  threshold  (Floor)  increases  die 
accuracy  of  template  matching.  We  believe  that  it  does  so  by  reducing  the  variance  of  the  zero¬ 
crossing  count  in  those  parts  of  an  utterance  for  which  this  information  is  not  discriminative  (i.e., 
vocalic  segements).  The  reduction  in  variance  contributes  to  a  less-noisy  match  between  the 
template  and  the  test  utterance. 

3.4  The  range  factor 

Zero-crossing  counts  have  an  inherent  instability  (as  do  all  acoustic  parameters  of  speech),  it  is 
therefore  desirable  to  reduce  their  variability.  This  can  be  done  by  quantizing  the  range  of  the 
function  and  producing  a  smaller  number  of  (discrete)  levels.  The  present  experiment  compares 
several  degrees  of  quantization  by  altering  the  RangeFactor  parameter  in  the  zero-crossing 
algorithm  (see  section  3.2).  Table  4  shows  the  results  of  varying  the  range  factor  for  two  different 
floor  values.  Note  that  variations  in  the  range  factor  appear  to  have  a  minimal  effect  on 
performance  —  all  values  obtained  are  within  0.5%  of  each  other. 

Table  4:  Range  Factor  experiment  Error  Rates  (%) 


Range  factor 

Floor =40 

Floor  =20 

3 

— 

24.83 

4 

25.18 

24.38 

5 

25.24 

24.38 

6 

25.33 

24.57 

7 

25.69 

- 

Taken  together  with  the  Floor  experiment,  these  data  indicate  that  the  full  range  of  zero- 
crossing  information  (i.e.,  2-164)  is  not  necessary  for  effective  use  of  this  parameter.  Thus,  the 
(clipped)  range  between  20-65,  divided  into  15  levels,  can  give  satisfactory  performance,  the 
interpretation  of  this  result  appears  straight-forward:  Low  zero-crossing  counts  (i.e.,  below  20)  are 
likely  to  come  from  the  vowel  portion  of  an  utterance,  since  all  vowels  in  the  set  studied  are  the 
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same,  the  zero-crossing  fine-structure  of  the  vowels  docs  not  contain  useful  information.  (Although 
this  may  not  be  an  appropriate  conclusion,  given  that  a  constant  vowel  environment  was  used.) 
Similarly,  zero-crossing  counts  of  over  65  are  almost  certainly  taken  from  aperiodic  portions  of  an 
utterance,  knowing  an  exact  count  above  that  ceiling  does  not  provide  any  additional  information 
and  only  contributes  unnecessary  variance.  The  range  between  the  ceiling  and  floor  values, 
however,  may  contain  useful  information  about  the  degree  of  frication  present  in  the  signal.  The 
results  of  the  RangeFactor  experiment  suggests  that,  again,  the  finc-stnicture  of  the  intermediate 
range  contributes  little  specific  information,  the  useful  information  being  the  fact  that  it  is  an 
intermediate  range. 

In  order  to  understand  the  role  of  fine  structure  in  the  intermediate  range,  an  additional 
experiment  was  performed,  quantizing  the  intermediate  range  to  successively  coarser  levels  (from 
the  original  15).  Two  levels  were  examined:  8  and  3.  The  mel  scale  spectral  representation  was  used 
for  this  experiment.  The  Floor  and  RangeFactor  parameters  were  set  to  20  and  4,  respectively. 
The  results  of  this  experiment  are  shown  in  Table  5. 

Table  5:  Range  quantization 


Number  of  levels  %  error 

3  28.77 

8  25.55 

15  25.22 


As  can  be  seen,  the  reduced  number  of  levels  leads  to  poorer  performance.  This  result  can  be 
interpreted  in  one  of  two  ways:  Either  the  proportion  of  zero  crossings  present  provides  useful 
information  and  the  coarse  quantization  destroys  this  information,  or  a  gradual  shift  from  one 
category  to  the  other  allows  the  recognition  process  to  recover  from  errors  of  categorization.  We 
believe  that  the  latter  is  the  case.  An  definitive  assessment  of  this  question,  however,  is  beyond  the 
scope  of  this  paper. 

3.5  Miscellaneous  factors 

This  section  describes  several  additional  manipulations  that  were  considered,  but  were  not 
pursued.  In  all  cases,  the  data  were  obtained  for  only  three  talkers  (ds,  fa,  gg).  These  talkers  were 
chosen  to  be  representative  of  the  entire  8  talker  set.  The  results  arc  displayed  in  Table  6.  The  100 
msec  non-overlapped  window  condition  uses  an  algorithm  originally  developed  by  [Niimi  80].  A 
number  of  findings  are  apparent:  using  the  raw  zero-crossing  count  degrades  recognition 
performance,  presumably  because  the  inherent  variability  of  the  zero-crossing  count  adds  more 
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noise  than  useful  information.  Doubling  the  weight  also  produces  a  decrement;  this  latter  result  is 
consistent  with  the  results  of  the  quatization  experiment  reported  in  section  3.4. 


Table  6:  Results  of  miscellaneous  experiments 


Manipuiadon  %  error 

mel  scale  only  27.81 

mel  scale;  raw  zero-crossing;  10msec  non-overlappcd  window  30.41 

mel  scale;  raw  zero-crossing;  20msec  overlapped  window;  value  divided  by  2.  31.18 

FreSt  scale  only  28.01 

FreSt  scale;  raw  zero-crossing:  20  msec  overlapped  window  39.56 

FreSt  scale;  floor=  30;  ranged  to  [-14..  ■+■ 14],  instead  of  [-7..+7]  27.40 

(i.e„  coefficient  weight  doubled) 


4  Analysing  the  improvement 

The  previous  section  has  established  that  zero-crossing  information  can  produce  an  improvement 
in  recognition  accuracy.  The  purpose  of  the  present  section  is  to  examine  in  more  detail  tne  process 
by  which  zero-crossing  information  produces  an  improvement  in  recognition  scores.  Two  questions 
will  be  dealt  with:  In  which  part  of  the  utterance  is  the  new,  discriminative  information  present? 
How  does  zero-crossing  information  interact  with  the  warping  process? 

4.1  The  locus  of  useful  zero-crossing  information 

The  addition  of  zero-crossing  information  was  hypothesized  to  enhance  the  discriminability  of 
fricative  and  voiced  portions  of  utterances.  The  utterances  in  the  "E-set"  differ  only  in  the  initial 
portion  of  the  utterance  and  so  we  should  expect  that  the  improvement  in  performance  is  due  to 
extra  information  at  the  beginning  of  the  utterance.  It  is  also  possible,  however,  that  for  some  reason 
utterances  will  differ  in  zero-crossing  count  not  only  at  the  beginning,  but  throughout  the  utterance. 
If  this  is  the  case,  then  we  would  be  dealing  with  a  qualitatively  different  phonetic  difference  than 
the  one  we  were  originally  trying  to  represent  (the  fricative/non-fricative  distinction).  To  assure  that 
the  improvement  in  performance  was  due  to  better  discriminability  based  on  the  initial  portion  of 
the  utterance,  we  performed  the  following  experiment:  All  utterances  were  divided  in  half  and  a 


recognition  run  was  done  separately  on  the  first  and  second  halves.  If  we  arc  dealing  with  a 
whole-utterance  phenomenon,  then  both  halves  should  show  some  improvement  in  performance 
once  zero-crossing  information  is  added.  If  the  additional  information  is  present  only  at  the 
beginning  of  the  utterance,  then  only  the  first  half  of  the  utterance  should  show  the  improvement 
For  this  experiment,  the  previously  determined  optimal  settings  for  the  zero-crossing  were  used,  i.e.. 
Floor:  20,  RangeFactor:  40.  The  results  arc  shown  in  Table  7. 

Table  7:  Error  rates  for  whole,  1st  half,  and  2nd  half  utterances 


whole 

1st  half 

2nd  half 

FrcSt  scale  only 

28.01 

24.29 

73.27 

FreSt  scale  and  z-c 

23.38 

21.75 

73.66 

improvement 

4.63 

2.54 

-0.39 

As  can  be  seen,  the  results  support  the  hypothesis:  Recognition  based  on  only  the  second  half  of  the 
utterance  is  equally  poor,  with  or  without  zero-crossing  information.  In  contrast,  zero-crossing 
information  improves  performance  for  recognitions  based  on  only  the  first  half  of  the  utterance. 
Two  other  interesting  points  should  be  noted  about  the  data:  There  is  an  overall  improvement  in 
performance  when  only  the  first  half  is  used,  second,  the  improvement  on  first  halves  (with  or 
without  zero-crossing  information)  is  less  than  the  improvement  obtained  by  adding  zero-crossing 
information  to  the  whole  utterance.  The  improvement  found  when  using  only  the  first  half  of  an 
utterance  can  be  attributed  to  the  elimination  of  mismatches  caused  by  spurious  differences  at  the 
ends  of  utterances.  We  believe  these  differences  arise  from  instabilities  in  the  speech  signal  that 
occur  when  phonation  ceases.4  In  a  template  matching  system,  the  resulting  difference  between 
reference  and  test  utterances  cannot  be  distinguished  from  meaningful  differences,  such  as  those 
which  occur  at  the  beginning  of  the  utterances. 

If,  for  a  given  unknown  utterance,  class  membership  (e.g.,  in  the  E-set)  can  be  reliably 
determined,  then  it  might  be  possible  to  improve  recognition  by  focussing  the  recognition  matching 
on  the  most  informative  portions  of  the  utterance  (see,  e.g.,  [Niimi  80],  and  [Bradshaw  82]).  The 
second  result  of  interest  is  the  relatively  smaller  improvement  for  the  first  half  when  zero-crossing 
information  is  added  points  to  the  asymptotic  nature  of  most  of  the  improvements  which  are  added 
(see  [Rudnicky  82]  for  further  discussion  of  this  point). 
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In  human  speech  perception,  the  information  necessary  for  identifying  an  utterance  must  have  presumably  been  extracted 
before  the  end  of  the  utterance  has  been  reached:  Instabilities  of  this  kind  do  not  appear  to  influence  the  percept  produced. 
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4.2  The  influence  of  zero-crossing  information  on  matching  behaviour 

The  matching  process  in  isolated  word  recognition  can  usually  be  factored  into  two  conceptually 
distinct  aspects:  1)  the  calculation  of  an  optimal  time  alignment  between  an  unknown  utterance  and 
a  reference  (i.c.,  establishing  a  warping  path),  2)  the  calculation  of  a  global  "distance”  between  the 
two  utterances,  typically  computed  from  inter-frame  distances  along  the  warping  path.  In  practice, 
however,  these  two  factors  arc  confounded  since  the  same  measure,  inter-frame  distance,  is  used  for 
both  the  choice  of  a  warping  path  and  for  the  calculation  of  distances.  In  the  context  of  the  present 
study,  it  is  of  interest  to  find  out  whether  the  increase  in  recognition  accuracy  is  due  to  the 
development  of  a  better  warping  path  or  to  the  enhancement  of  the  score  calculated  along  an 
existing  path,  or  to  a  combination  of  the  two. 


With  this  in  mind,  an  experiment  was  designed  to  separate  the  path-formation  and  distance 
formation  aspects  of  the  zero-crossing  coefficient.  The  following  four  conditions  were  used: 

1.  Zerocrossing  coefficient  used  for  both  path  formation  and  distance  calculation.  (This  is 
identical  to  the  configurations  used  for  the  experiments  in  section  3). 

2.  Zerocrossing  coefficient  used  for  neither  path  nor  distance. 

3.  Zerocrossing  information  used  in  conjunction  with  spectral  information  to  determine 
best  path.  Distance  calculated  along  the  path  without  using  the  zero-crossing  coefficient 

4.  Spectral  information  alone  used  to  determine  the  best  path.  Distance  calculated  with 
both  spectrum  and  zero-crossing  coefficients. 

The  results  of  the  experiment  are  presented  in  Table  8.  It  is  quite  apparent  that  zero-crossing 
information  contributes  only  to  the  distance  calculation  component  of  the  matching  process  and  not 
to  the  path-establishing  component  There  is  no  interaction  between  the  two. 


We  can  draw  the  following  conclusions  from  this  experiment  Time  alignment  based  on  spectral 
information  alone  produces  what  appears  to  be  an  optimal  alignment  between  a  test  and  reference 


utterance,  at  least  one  winch  cannot  be  improved  upon  by  the  addition  of  zero-crossing  information. 
This  result  suggests  that  a  fruitful  approach  to  template  matching  might  be  the  use  of  spectrum- 
based  warping  to  align  two  utterances,  followed  by  the  use  of  this  alignment  to  perform  more 
detailed  spectrum  and  feature-based  distance  computation  over  corresponding  time  frames  (along 
the  warping  path)  of  the  unknown  and  reference  utterances.  It  also  suggests  that  a  promising  avenue 
of  exploration  might  be  to  determine  the  minimal  information  needed  to  produce  optimal 
alignment,  thereby  freeing  a  system’s  resources  to  allow  them  to  be  concentrated  on  post-alignment 
processing. 


5  Summary 

The  experiments  described  in  this  paper  have  shown  that  zero-crossing  information  can  be 
successfully  used  to  augment  a  spectral  representation  in  a  template-matching  speech  recognition 
system.  The  reduction  in  error  rate,  over  8  talkers,  is  between  10-13%  depending  on  the  spectral 
mapping.  Taking  into  account  that  the  specific  parameter  values  presented  in  this  paper  are  in  all 
likelyhood  not  generalizable  beyond  the  present  recognition  system,  the  alpha-digit  vocabulary,  and 
the  set  of  talkers,  we  can  offer  the  following  guidelines  for  the  use  of  zero-crossing  information: 

•  Effective  use  of  zero  crossing  information  requires  the  elimination  of  "noise”.  This  can 
be  done  by  defining  an  "active"  range  for  the  count  using  floor  and  ceiling  values.  Once 
defined,  this  range  can  be  effective  even  if  quantized  to  a  smalt  number  of  levels. 

•  A  promising  strategy  for  template  based  recognition  might  be  to  separate  time  alignment 
and  distance  score  computation.  Simple  spectral  matching  can  be  used  to  perform  time 
alignment,  while  computationally  more  expensive  discriminative  techniques  (based  on 
feature  extraction  and  phonetic  knowledge)  can  be  used  to  calculate  the  distance  score. 
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